Skip to main content

Full text of "Bayesian Online Changepoint Detection"

See other formats


Bayesian Online Changepoint Detection 



Ryan Prescott Adams 

Cavendish Laboratory 
Cambridge CBS OHE 
United Kingdom 



David J.C. MacKay 

Cavendish Laboratory 
Cambridge CBS OHE 
United Kingdom 



Abstract 

Changepoints are abrupt variations in the 
generative parameters of a data sequence. 
OnUne detection of changepoints is useful in 
modehing and prediction of time series in 
apphcation areas such as finance, biomet- 
rics, and robotics. While frequentist meth- 
ods have yielded online filtering and predic- 
tion techniques, most Bayesian papers have 
focused on the retrospective segmentation 
problem. Here we examine the case where 
the model parameters before and after the 
changepoint are independent and we derive 
an online algorithm for exact inference of the 
most recent changepoint. We compute the 
probability distribution of the length of the 
current "run," or time since the last change- 
point, using a simple message-passing algo- 
rithm. Our implementation is highly modu- 
lar so that the algorithm may be applied to 
a variety of types of data. We illustrate this 
modularity by demonstrating the algorithm 
on three different real-world data sets. 



1 INTRODUCTION 

Changepoint detection is the identification of abrupt 
changes in the generative parameters of sequential 
data. As an online and offline signal processing tool, it 
has proven to be useful in applications such as process 
control [1], EEC analysis [5, 2, 17], DNA segmentation 
[6], econometrics [7, 18], and disease demographics [9]. 

Frequentist approaches to changepoint detection, from 
the pioneering work of Page [22, 23] and Lorden [19] 
to recent work using support vector machines [10], of- 
fer online changepoint detectors. Most Bayesian ap- 
proaches to changepoint detection, in contrast, have 
been offline and retrospective [24, 4, 26, 13, 8]. With a 



few exceptions [16, 20] , the Bayesian papers on change- 
point detection focus on segmentation and techniques 
to generate samples from the posterior distribution 
over changepoint locations. 

In this paper, we present a Bayesian changepoint de- 
tection algorithm for online inference. Rather than 
retrospective segmentation, we focus on causal predic- 
tive filtering; generating an accurate distribution of the 
next unseen datum in the sequence, given only data al- 
ready observed. For many applications in machine in- 
telligence, this is a natural requirement. Robots must 
navigate based on past sensor data from an environ- 
ment that may have abruptly changed: a door may 
be closed now, for example, or the furniture may have 
been moved. In vision systems, the brightness change 
when a light switch is flipped or when the sun comes 
out. 

We assume that a sequence of observations 
xi,X2,---,xt may be divided into non-overlapping 
product partitions [3]. The delineations between 
partitions are called the changepoints. We further 
assume that for each partition p, the data within it 
are i.i.d. from some probability distribution P{xt \ rjp). 
The parameters rjp, p — 1,2,... are taken to be i.i.d. 
as well. We denote the contiguous set of observations 
between time a and h inclusive as Xaj). The discrete 
a priori probability distribution over the interval 
between changepoints is denoted as Pgapig)- 

We are concerned with estimating the posterior dis- 
tribution over the current "run length," or time since 
the last changepoint, given the data so far observed. 
We denote the length of the current run at time t by 
rt- We also use the notation xl ' to indicate the set 
of observations associated with the run rt. As r may 
be zero, the set x'^'^^ may be empty. We illustrate the 
relationship between the run length r and some hypo- 
thetical univariate data in Figures 1(a) and 1(b). 



t. 


Qz 


93 


• • 




• 

• 


• • 




H — \ — \ — h 


H — \ — \ — \ — \ — h 


H — ^ — ^ — \ — ► 





(a) 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 t 

(b) 




Figure 1: This figure illustrates how we describe a 
changepoint model expressed in terms of run lengths. 

Figure 1 (a) shows hypothetical univariate data divided 
by changepoints on the mean into three segments of 
lengths gi = 4, 52 = 6, and an undetermined length 
513. Figure 1(b) shows the nm length rj as a function 
of time, ft drops to zero when a changepoint occurs. 
Figure 1(c) shows the trellis on which the message- 
passing algorithm lives. Solid lines indicate that prob- 
ability mass is being passed "upwards," causing the 
run length to grow at the next time step. Dotted lines 
indicate the possibility that the current run is trun- 
cated and the run length drops to zero. 



2 RECURSIVE RUN LENGTH 
ESTIMATION 

We assume that we can compute the predictive distri- 
bution conditional on a given run length rt- We then 
integrate over the posterior distribution on the current 
run length to find the marginal predictive distribution: 



P{Xt+l I Xi-t) 



Y,P{xt+i\rt,x['-^)P{n\x,..t) (1) 



To find the posterior distribution 

P{rt,xi:t) 



Pin I Xl:t) = 



P{xv.t) 



(2) 



we write the joint distribution over run length and 
observed data recursively. 

P{rt,Xi:L} = ^ P{rt,rt-i,xi:t) 

rt-i 

= P{rt,xt\rt-i,xi:t-i)P{rt-i,xi:t-i) (3) 

rt-l 

= ^ P{rt I rt-i)Pixt I rt-i,xi''>)P{rt-i,Xi..t-i) 
n-i 

Note that the predictive distribution P{xt \rt-i,Xi;t) 

(r) 

depends only on the recent data x). . We can thus 

generate a recursive message-passing algorithm for the 
joint distribution over the current run length and the 
data, based on two calculations: 1) the prior over 
given rt-i, and 2) the predictive distribution over the 
newly-observed datum, given the data since the last 
changepoint. 

2.1 THE CHANGEPOINT PRIOR 

The conditional prior on the changepoint P{rt \rt-i) 
gives this algorithm its computational efficiency, as it 
has nonzero mass at only two outcomes: the run length 
either continues to grow and rt = rt-i -|- 1 or a change- 
point occurs and rt = 0. 



P{rt\rt-i) 




if rt = 

if rt = rt-i + 1 

otherwise 



The function -ff(r) is the hazard function. [11]. 



Hir) = 



■Pgap(ff=T) 



(4) 



(5) 



In the special case is where Pgap(.g) is a discrete expo- 
nential (geometric) distribution with timescale A, the 
process is memoryless and the hazard function is con- 
stant at H{t) = 1/A. 

Figure 1(c) illustrates the resulting message- passing 
algorithm. In this diagram, the circles represent run- 
length hypotheses. The lines between the circles show 
recursive transfer of mass between time steps. Solid 
lines indicate that probability mass is being passed 
"upwards," causing the run length to grow at the next 
time step. Dotted lines indicate that the current run 
is truncated and the run length drops to zero. 



1. Initialize 

P(ro) = S{r) or P(ro=0) = 1 

(0) _ 

(0) 

1 A. prior 

2. Observe New Datum xt 

3. Evaluate Predictive Probability 

4. Calculate Growth Probabilities 

P(rt=rt_i+l,a;i:t) = P{rt-i,xv.t-i)4"\^-H{rt-i)) 

5. Calculate Changepoint Probabilities 

P{rt=0,xv.t) = J2P{rt-i,xv.t-i)ni^^H{rt-i) 

rt-l 

6. Calculate Evidence 

P(a;i:t) =^P(rt,a^i:t) 

rt 

7. Determine Run Length Distribution 

P{rt\xi..t) = P{rt,xi..t)/P{xi:t) 

8. Update Sufficient Statistics 

(0) 

V - . Xprior 

9. Perform Prediction 
10. Return to Step 2 



2.2 BOUNDARY CONDITIONS 

A recursive algorithm must not only define the recur- 
rence relation, but also the initialization conditions. 
We consider two cases: 1) a changepoint occurred a 
priori before the first datum, such as when observing 
a game. In such cases we place all of the probability 
mass for the initial run length at zero, i.e. P(ro=0) = 1. 
2) We observe some recent subset of the data, such as 
when modelling climate change. In this case the prior 
over the initial run length is the normalized survival 
function [11] 

P(ro=T) = ^S{t), (6) 
where Z is an appropriate normalizing constant, and 

oo 

S{r) = E ^gap(5=t). (7) 

t=T+l 

2.3 CONJUGATE-EXPONENTIAL 
MODELS 

Conjugate-exponential models arc particularly conve- 
nient for integrating with the changepoint detection 
scheme described here. Exponential family likelihoods 
allow inference with a finite number of sufficient statis- 
tics which can be calculated incrementally as data ar- 
rives. Exponential family likelihoods have the form 

P{x\r]) = h{x)exp{r]''U{x) - A{r])) (8) 

where 

A{r]) = log j dr) h{x) exp (r/T[/(a;)) . (9) 

The strength of the conjugate-exponential representa- 
tion is that both the prior and posterior take the form 
of an exponential-family distribution over rj that can 
be summarized by succinct hyperparameters v and %. 

P{n\x, i") = Hv) exp (ry^x - i'A(ry) - A{x, 

(10) 

We wish to infer the parameter vector rj associated 
with the data from a current run length rt ■ We denote 

(r) 

this run-specific model parameter as rjl '. After find- 

(r) {r)\ 

ing the posterior distribution P{ril \rt,x^ ), we can 
marginalize out the parameters to find the predictive 
distribution, conditional on the length of the current 
run. 

P{xt+i \rt) = J dr, P{xt+i I r,)Piv['^^ \ r*, a;*'')) 

(11) 



Algorithm 1: The online changepoint algorithm with 
prediction. An additional optimization not shown is 

to truncate the per-timestep vectors when the tail of 
Pirtlxiit) has mass beneath a threshold. 



This marginal predictive distribution, while gener- 
ally not itself an exponential-family distribution, is 
usually a simple function of the sufficient statistics. 
When exact distributions are not available, compact 
approximations such as that described by Snelson and 
Ghahramani [25] may be useful. We will only ad- 
dress the exact case in this paper, where the predictive 
distribution associated with a particular current run 

(r) (r) 

length is parameterized by and Xt ■ 

v't^ = Vpnox -^rt (12) 



xle5 








1 


10-^ 




10-^ 2 








10-* 




10^ 



1700 



1900 



2300 



2500 



2700 



2100 
Time 

Figure 2: The top plot is a 1100-datum subset of nuclear magnetic response during the drilling of a well. The 
data are plotted in light gray, with the predictive mean (solid dark line) and predictive 1-a error bars (dotted 
lines) overlaid. The bottom plot shows the posterior probability of the current run P{rt \ xi;t) at each time step, 
using a logarithmic color scale. Darker pixels indicate higher probability. 



2.4 COMPUTATIONAL COST 

The complete algorithm, assuming exponential-family 
likelihoods, is shown in Algorithm 1. The space- and 
time-complexity per time-step are linear in the number 
of data points so far observed. A trivial modification of 
the algorithm is to discard the run length probability 
estimates in the tail of the distribution which have a 
total mass less than some threshold, say 10"''. This 
yields a constant average complexity per iteration on 
the order of the expected run length E[r], although 
the worst-case complexity is still linear in the data. 

3 EXPERIMENTAL RESULTS 

In this section we demonstrate several implementa- 
tions of the changepoint algorithm developed in this 
paper. We examine three real-world example datasets. 
The first case is a varying Gaussian mean from well- 
log data. In the second example we consider abrupt 
changes of variance in daily returns of the Dow Jones 
Industrial Average. The final data are the intervals 
between coal mining disasters, which we model as a 
Poisson process. In each of the three examples, we use 
a discrete exponential prior over the interval between 
changepoints. 

3.1 WELL-LOG DATA 



rock surrounding the well. The variations in mean 
reflect the stratification of the earth's crust. These 
data have been studied in the context of changepoint 
detection by O Ruanaidh and Fitzgerald [21], and by 
Fearnhead and Chfford [12]. 

The changepoint detection algorithm was run on these 
data using a univariate Gaussian model with prior pa- 
rameters n = 1.15 X 10^ and a = Ix 10"'. The rate of 
the discrete exponential prior, Agap, was 250. A sub- 
set of the data is shown in Figure 2, with the predic- 
tive mean and standard deviation overlaid on the top 
plot. The bottom plot shows the log probability over 
the current run length at each time step. Notice that 
the drops to zero run-length correspond well with the 
abrupt changes in the mean of the data. Immediately 
after a changepoint, the predictive variance increases, 
as would be expected for a sudden reduction in data. 

3.2 1972-75 DOW JONES RETURNS 

During the three year period from the middle of 1972 
to the middle of 1975, several major events occurred 
that had potential macroeconomic effects. Significant 
among these are the Watergate affair and the OPEC 
oil embargo. We applied the changepoint detection 
algorithm described here to daily returns of the Dow 
Jones Industrial Average from July 3, 1972 to June 30, 
1975. We modelled the returns 



These data are 4050 measurements of nuclear magnetic 
response taken during the drilling of a well. The data 
are used to interpret the geophysical structure of the 



(14) 






500 




400 






□1 


300 


3 






200 


i 


IQO 







1 


10-^ 




10'* .w; 




10^1 




10 ■ 




10"^ 



30 January 1973: 
Cbnvitt^d 



19 October 1973: 
apEli. Erntujrgs Begins 



9 August 1974: 



Figure 3: The top plot shows daily returns on the Dow Jones Industrial Average, with an overlaid plot of the 
predictive volatility. The bottom plot shows the posterior probability of the current run length P{rt \ Xi:t) at 
each time step, using a logarithmic color scale. Darker pixels indicate higher probability. The time axis is in 
business days, as this is market data. Three events are marked: the conviction of G. Gordon Liddy and James 
W. McCord, Jr. on January 30, 1973; the beginning of the OPEC embargo against the United States on October 
19, 1973; and the resignation of President Nixon on August 9, 1974. 



(where is the daily closing price) with a zero- 

mean Gaussian distribution and piecewise-constant 
variance. Hsu [14] performed a similar analysis on a 
subset of these data, using frequentist techniques and 
weekly returns. 

We used a gamma prior on the inverse variance, with 
a = 1 and b = 10~^. The exponential prior on change- 
point interval had rate Agap = 2-50. In Figure 3, the 4 DISCUSSION 
top plot shows the daily returns with the predictive 
standard deviation overlaid. The bottom plot shows 
the posterior probability of the current run length, 
P{rt\xi:t). Three events are marked on the plot: 
the conviction of Nixon re-election officials G. Gordon 
Liddy and James W. McCord, Jr., the beginning of 
the oil embargo against the United States by the Or- 
ganization of Petroleum Exporting Countries (OPEC), 
and the resignation of President Nixon. 

3.3 COAL MINE DISASTER DATA 

These data from Jarrett [15] are dates of coal mining 
explosions that killed ten or more men between March 
15, 1851 and March 22, 1962. We modelled the data 
as an Poisson process by weeks, with a gamma prior 
on the rate with a = b = 1. The rate of the exponen- 



Possion process determines the local average of the 
slope. The posterior probability of the current run 
length is shown in the bottom plot. The introduc- 
tion of the Coal Mines Regulations Act in 1887 (cor- 
responding to weeks 1868 to 1920) is also marked. 



This paper contributes a predictive, online interpeta- 
tion of Bayesian changepoint detection and provides 
a simple and exact method for calculating the poste- 
rior probability of the current run length. We have 
demonstrated this algorithm on three real-world data 
sets with different modelling requirements. 

Additionally, this framework provides convenient de- 
lineation between the implementation of the change- 
point algorithm and the implementation of the model. 
This modularity allows changepoint-detection code to 
use an object-oriented, "pluggable" type architecture. 



tial prior on the changepoint inverval was Agap = 1000. 
The data are shown in Figure 4. The top plot shows 
the cumulative number of accidents. The rate of the 



Acknowledgements 

The authors would like to thank Phil Cowans and Mar- 
ian Frazier for valuable discussions. This work was 
funded by the Gates Cambridge Trust. 



B 200 




3000 
Weeks 



5000 



Coal Mines 



Regulations Act 

Figure 4: These data are the weekly occurrence of coal mine disasters that killed ten or more people between 
1851 and 1962. The top plot is the cumulative number of accidents. The accident rate determines the local 
average slope of the plot. The introduction of the Coal Mines Regulations Act in 1887 is marked. The year 1887 
corresponds to weeks 1868 to 1920 on this plot. The bottom plot shows the posterior probability of the current 
run length at each time step, P{rt \ Xi-t). 



References 

[1] Leo A. Aroian and Howard Levene. The effec- 
tiveness of quality control charts. Journal of the 
American Statistical Association, 45(252):520- 
529, 1950. 

[2] J. S. Barlow, O. D. Creutzfeldt, D. Michael, 
J. Houchin, and H. Epelbaum. Automatic 
adaptive segmentation of clinical EEGs,. Elec- 
troencephalography and Clinical Neurophysiology, 
51(5):512-525, May 1981. 

[3] D. Barry and J. A. Hartigan. Product partition 
models for change point problems. The Annals of 
Statistics, 20:260-279, 1992. 

[4] D. Barry and J. A. Hartigan. A Bayesian analysis 
of change point problems. Journal of the Ameri- 
can Statistical Association, 88:309-319, 1993. 

[5] G. Bodenstein and H. M. Practorius. Feature ex- 
traction from the electroencephalogram by adap- 
tive segmentation. Proceedings of the IEEE, 
65(5):642-652, 1977. 

[6] J. V. Braun, R. K. Braun, and H. G. Miiller. 
Multiple changepoint fitting via quasilikelihood, 
with application to DNA sequence segmentation. 
Biometrika, 87(2):301-314, June 2000. 



[7] Jie Chen and A. K. Gupta. Testing and locating 
variance changepoints with application to stock 
prices. Journal of the American Statistical Asso- 
ciation, 92(438) :739-747, June 1997. 

[8] Siddhartha Chib. Estimation and comparison of 
multiple change-point models. Journal of Econo- 
metrics, 86(2):221-241, October 1998. 

[9] D. Denison and C. Holmes. Bayesian partitioning 
for estimating disease risk, 1999. 

[10] F. Desobry, M. Davy, and C. Doncarli. An on- 
line kernel change detection algorithm. IEEE 
Transactions on Signal Processing, 53(8):2961- 
2974, August 2005. 

[11] Merran Evans, Nicholas Hastings, and Brian 
Peacock. Statistical Distributions. Wiley- 
Interscience, June 2000. 

[12] Paul Fearnhead and Peter Clifford. On-line in- 
ference for hidden Markov models via particle fil- 
ters. Journal of the Royal Statistical Society B, 
65(4):887-899, 2003. 

[13] P. Green. Reversible jump Markov chain Monte 
Carlo computation and Bayesian model determi- 
nation, 1995. 

[14] D. A. Hsu. Tests for variance shift at an un- 
known time point. Applied Statistics, 26(3) :279- 
284, 1977. 



[15] R. G. Jarrctt. A note on the intervals between 
coal-mining disasters. Biometrika, 66(1):191-193, 
1979. 

[16] Timothy T. Jervis and Stuart I. Jardine. Alarm 
system for wellbore site. United States Patent 
5,952,569, October 1997. 

[17] A. Y. Kaplan and S. L. Shishkin. Application of 
the change-point analysis to the investigation of 
the brain's electrical activity. In B. E. Brodsky 
and B. S. Darkhovsky, editors, Non-Parametric 
Statistical Diagnosis : Problems and Methods, 
pages 333-388. Springer, 2000. 

[18] Gary M. Koop and Simon M. Potter. Forecast- 
ing and estimating multiple change-point models 
with an unknown number of change points. Tech- 
nical report. Federal Reserve Bank of New York, 
December 2004. 

[19] G. Lorden. Procedures for reacting to a change in 
distribution. The Annals of Mathematical Statis- 
tics, 42(6):1897-1908, December 1971. 

[20] J. J. 6 Ruanaidh, W. J. Fitzgerald, and K. J. 
Pope. Recursive Bayesian location of a discon- 
tinuity in time series. In Acoustics, Speech, and, 
Signal Processing, 1991 ICASSP-94., 1994 IEEE 
International Conference on, volume iv, pages 
IV/513-IV/516 vol.4, 1994. 

[21] Joseph .1. K. O Ruanaidh and William J. Fitzger- 
ald. Numerical Bayesian Methods Applied to 
Signal Processing (Statistics and Computing). 
Springer, February 1996. 

[22] E. S. Page. Continuous inspection schemes. 
Biometrika, 41(1/2):100-115, June 1954. 

[23] E. S. Page. A test for a change in a parame- 
ter occurring at an unknown point. Biometrika, 
42(3/4):523-527, 1955. 

[24] A. F. M. Smith. A Bayesian approach to infer- 
ence about a change-point in a sequence of ran- 
dom variables. Biom.etrika, 62(2):407 416, 1975. 

[25] Edward Snelson and Zoubin Ghahramani. Com- 
pact approximations to Bayesian predictive dis- 
tributions. In ICML '05: Proceedings of the 
22nd international conference on Machine learn- 
ing, pages 840-847, New York, NY, USA, 2005. 
ACM Press. 

[26] D. A. Stephens. Bayesian retrospective multiple- 
changepoint identification. Applied Statistics, 
43:159-178, 1994.